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ABSTRACT 

A central problem for the user of domain-referenced 
tests in instruction is deciding who has passed and who has failed. 
Two procedures were presented and discussed. The firsts employing 
classical test theory^ was found to be moie useful for larger domains 
and where the passing standard is 70 percent or less. The sampling 
procedure suggested by Millman (1974) was found to be more applicable 
when the test size approximates the size of the domain • Neither 
procedure appears useful when the passing standard is high. In light 
of the large numbers of examinees classified as uncertain when real 
test data is used, it was concluded that neither procedure offers 
much to decisionmaking in systematic individualized instruction. 
(Author) 
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ABSTRACT 

A central problem for the user of domain-referenced tests in instruction 
is deciding who has passed and who has failed. Tv70 procedures were pre- 
sented and discussed. The first, employing classical test theory, was 
found to be more useful for larger domains and where the passing standard 
is 70% or less. The sampling procedure suggested by Millman (1974) was 
found to be more applicable when the test size approximates the size of 
the domain. Neither procedure appears useful when the passing standard is 
high, in light of the large numbers of examinees classified as uncertain 
when real test data is used, it was concluded that neither procedure of- 
fers much to decisionmaking in systematic individualized instruction. 



ERLC 



One problem in systematic individualized instruction is determining 
which students have passed or failed a test representing ai. achievement, 
domain. It has been widely advocated that a passing standard (PS) be set 
for institutional situations and chat tests be constructed which carefully 
correspond to instructional intent (Millman, 1974b; Haiableton and Kovick, 
1973). Those examinees whose score fall at or near the PS are in jeopardy 
of being nisclassif ied due to errors of measurement. One solution to this 
problem is to set a confidence interval around the PS and to make decisions 
based on how each examinee scores with respect to that confidence interval. 
Those who score above the confidence interval pass, those who score below 
fail, and those who score v/ithin the confidence interval are given remedial 
instruction or further testing until cheir status is ascertained. 

The statistical estimation of confidence intervals can be done in a 
number of ways, depending upon the theoretical orientation. This particular 
study is limited to two very contrasting approaches; the first is a procedure 
from classical test theory, and the second is an item sampling technique 
recently suggested by Millman (1974a). 

To begin this analysis, it is necessary to provide a useful definition 
of a domain-referenced test (DRT) and then briefly describe the instructional 
context for which the decisionmaking models are advocated. Then the two 
procedures are examined in light of some fundamental operations in scienti-- 
fic inquiry. 

Defining the Construct s Hively (l974 p. 8) has described a domain as 
•*any specified set of items/' Millman (1974a) defines a DRT as a random 
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,ami>lo o£ items £rom a domain. However, these brief definitions deserve 
more attention. It is clear from Miiliaan^s extensive treatment of DRT theory 
(1974b) that the use of instructional objectives was not intended to be the 
device for the careful specification of a set of items for a domain. Wliat 
is advocated are item writing rules in the spirit of Hively, Patterson, 
and Page (1968) and Bormuth (1970). However, the technology £o'c such item 
generation is neonatal. In the present context, a domain will be considered 
any set of items that is conceptually related to an instructional unit or 
intent. The unitary nature of the set of it ims is a defining trait of the 
domain, and any random sample of items is a DRT. The domain may be infinite 
in si2e, or it may consist of only several items. The latter instance is 
viewed as ejctremely unlikely in modern systematic instruction where items 
and item pools are numerous. 

T he Instructional Context . The current press for individualized in- 
struction has led to many types of systematic instruction. Some of these 
are Bloom's mastery learning (1968), Individually Prescribed Instruction 
(IPI), and Program for Learning in Accordance with Needs (PLAN). The lat- 
ter two were recently reviewed by Hambleton (1975). Regardless of the 
specific instructional system employed, most systems allov; t ime-to-learn 
to vary v^ith the individuals; and most advocate the use of frequent testing, 
usually prior to and following instruction. Thus a FS is required, and 
students must ultimately be assigned to a pass or fail category. Despite 
the fact that this analysis is focused on the decisionmaking issue, the 
problem of where to set the PS is inextricably connected with the former. 

Some Fundar.ental Operations in Scientific Inquiry. According to Kuhn 
(1962), science often advances with scientific revolutions. The current 
trend away from classical test theory and toward new approaches to classroom 



achievement testing mny quaxify as a departure from ^'normal science"* The 
test of the tuo approaches from a theory viewpoint is couched in logical 
and statistical criceria* One is initially concerned with the inferences 
drawn from each approach and the generality of an approach to the wide array 
of achievement testing situations conunon to syst em atic individualized in- 
struction* Later, the theory is tested to ^ee if data fits the model* The 
interphase between data and theory is a necessary condition in theory veri- 
fication (Kaplan, 1964) • In the present context, two rivaling hypothesis 
are examined both theoretically and empirically* Classical theory has been 
rejected by many "new theory" advocates, thus the discussion of classical 
theory focuses on these criticisms and the nature and scope of .lassical 
theory. IVliile the item sampling approach is presented as a direct solution 
to DRT construction and use. 

Two Approaches to Decisionmaking 

Regardless of the approach considered, the crux of the problem in decision- 
making in this instructional setting is that of knowing about true scores^ 
In cl^ssical theory, a true score is the expected observed score. It is 
estimated from the product of the reliability estimate and the standardized 
observed score, the result is a regressed true score estimate. In sampling 
theory, the observed score is considered to be an unbiased estimator of the 
domain score (analogous to the true score). In other approaches (e.g. Rasch 
models, baysian approach, and Cronback's theory of gencralizability) , the 
true scores are conceptualized differently. With each approach, the standard 
error (SE) may vary. Thus it is held that the procedure that yields the 
smaller SE for a wide variety of test situations is probably most effective 
for decisionmaking. Analogously, when classical reliability as estimated 
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two different ways (KR-20 and KI'l-Zl) , the latter is an underestimation 
which leads to overestimated SE*s. Ir. the same respect, various approaches 
lead to estimates of error which may be too large to be useful in instructional 
decisionmaking. If an approach leads to assigning most students to a fail- 
ing status or uncertainty, one has to question the usefulness of the model, 
not on logical or statistical criteria, bu*- on empirical. 

Ideally, the rationale for setting a PS should be one of predictive 
validity. Those scoring above the PS are very likely to be successful in 
another unit of instruction or on a job. Those not passin^ have little 
likehood of future success in the next instructional unit or job. In this 
ideal situation, the distribution of test scores is bimodal, with noninstructed 
students scoring at the floor or the scale and instructed students scoring 
at the ceiling. With the PS set to minimize the errors of misclassif ication, • 
an approach to setting confidence intervals bears importantly on decision- 
making. This might lead to a suspicion that DR tests might have the optimal 
PS at the midpoint of the achievement scale. 

The Classical Approach . With dichotomously scorable items, the SE is 
computed using a KR-20 estimate of reliability. In classical theory, the 
SE does not apply to error surrounding the observed scores. Instead, it 
refers to the distribution of observed scores around a true score. Taking 
the PS as a point on the scale where a true score must exist in order to 
justify a passing status, those persons whose true scores fall at the PS 
will have observed scores plus or minus two SE*s around the PS about 95% 
of the time. To minimize errors of misclassif icatJon, we either continue 
instruction or provide specific remedial instruction as determined from 
subscale scores. In other words, we attempt to change the student's status 
positively so an accurate assignment of **pass" can be made. 



Critics of classical theory for such testing have maintaiued that 
classical test theory is a norm-referenced (NR) approach to measurement 
and does not yield the type of information required in a criterion- 
referenced (CR) situr.tion. According to this argument (e.g. Popham and 
Husek, 1969; Carver, 1974), a NR test theory yields information about the 
relative differences among examinees, whereas CR tests yield information 
about the percentage of tasks (test items) that a student can do (answer 
correctly) from a well-defined universe of tasks. Donlon (1974) and 
Millman (1974b) have discussed the semantic difficulties of CR, DR. 
The problem with definitions of the concepts has made the study of DRT's 
more difficult. 

It has been popularly held that classical test theory leads to NR test 
interpretations, while item sampling theory leads to DR test interpretations. 
However, there is evidence to dispute these beliefs. Nunnally (1967) has 
presented classical theory as a "domain sampling" model. Any test is "a 
random saraple of items from a hypothetical domain of items" (Nunnally, 1967, 
p* 175). Lord and Novick (1968) have also defined classical theory as a 
random sampling procedure from a well-defined iet of test items. The 
randomness in sampling items from the domain is quite explicit in theory, 
although admittedly seldom practiced. Thus it is the us'^. of classical 
theory rather than the theory itself that appears faulty. 

Classical test scales yield two fundamental types of information, 
absolute and relative. The absolute information is seen as DR, and the 
relative information is NR. Donlon (1974) among others, has clarified thi5 
relationship and extended our understanding of the various test uses to a 
number of applications. As Ebel (1974) has stated, a test is a test* Ifnat 
we choose to do with the results determines the designation CR, DR, NR. 
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If a test is constructed in a manner specified in classical theory, it is 
held that NR or DR interpretations are possible. Based on this argument, 
classical theory is advocated as a useful approach to decisionmaking along 
with a constellation of other approaches, many of which were cited briefly 
earlier in this paper. 

The statistical aspects of decisiorjnaking have been criticized by Popham 
and Husek (1969) and later by others fa.g. Carver, 1974). The issue here 
is one of variance. Scores following instruction are said to be restricted 
to the degree that classical estimates of reliability are useless. Millcian 
and Popham (1974) ha\ 3 also argued that variance is actually an irrelevant 
concept in CR or DR testing, since the measurement requirements involve only 
a person's status with respect to a well -defined domain of items. The 
suspicion that variance is reduced following instruction has not been empiri- 
cally verified. In fact, v/hen instruction is not as effective as one might 
hope, the opposite appears true; variance is quite substantial. Woodson 
(1974a, 1974b) has argued persuasively that the suspected lack of variance 
may be due to a restrir.tec. and inappropriate sampling of examinees. Since 
the test is calibrated to discriminate between instructed and noninstructed 
persons, items should be calibrated on the entire range of abilities. This 
was empirically substantiated v/ith CU t^sts in one study by Haladyna (1974). 

However, the attention given to variance and reliability may be mis- 
dir€'Cted. As previously noted, test variance has much to do with the es- 
timation of reliability but nothing to do with computing the SE. Reliability 
is only a device to gain information so a SE can be computed. Since the 
SE are the important statistics in decisionmaking, and SE are constant re- 
gardless of the sample tested; it would seem appropriate to compute a SE 
from any sample of examinees. 



To nunuarize, the essence for tiic logicai-statistical rationale for use- 
ing classical theory in decisionmaking in the DR context is that any achieve- 
ment test is viewed simply as a raeans for measurement. Ifnat occurs following 
that measurement is viewed as "NR" *'DR'* or "CR". Strictly speaking, the definition 
offered by advocates of DRT theory is semantically identical to that classical 
theory. In this respect the use of classical theory for the DR test use is 
entirely consistent. 

Item Sampling . The sampling approach for DRT's presented by Millman 
(1974) is an application of item sampling theory as described in Chapter 11 
of Lord and Novick, (1968). The procedure calls for random samples of items 
for any test to be drara from a v;ell-def ined domain of items. No restrictions 
are put on the domain, unlike classical theory where homogenity is a useful 
concept. Some of the assumptions of the item sampling model are: (a) a domain 
is definable in terms of items which need not be conceptually or empirically 
homogeneous as a domain is in classical theory, (b) any examinee's score Ls 
an unbiased estimator of his domain score, (c) the score and the interpreta- 
tion of the score are independent of any other examinee's score or of the 
qualities of the test (i.e. test variance, reliability, and item discrimination), 
''n fact, Millman (1974b) has maintained that the tampering of items in a 
domain nay limit or change the quality of the domain. Item analysis is there- 
by restricted to locating and discarding or revising defective items. In 
classicol tiieory, one seeks items that measure a domain through item analysis 
or similar procedures. 

An uncertainty band Is constructed which is conceptually analogous to 
the confidence interval iu classical theory. Two UB's, like p]\is or minus 
two SE's, forms a 95% confidence interval. It is interesting to note that 
a classical SE is independent of the PS, while the item sampling UB is 
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dependent on the PS* Oac reason £01: the latter is that there is no account- 
ing of the source of , lueasurenent error due to ambiguous or non-discriminating 
items. Therefore, oi.e must conclude that the PS is set for a bRT to adjust 
for items which are variably discriminating. Further, the PS is adjusted 
upwards or dorawards to compensate for the decreased or increased amount of 
measurement error arising from variable item discrimination indexes. Without 
this assumption, the item sampling procedure would not account for a source 
of error which is built into the classical model. 

hillman (1974a) has stated that those students falling in the uncertainty 
band should be given more test items until their status is determined. The 
administration of more items decreases the size of the UB so that more pass 
or fail assignments can be made. If a student has scored at or extremely 
close to the PS, the number of items needed could be inordinate. Rather 
than take longer tests, it might be advisable to offer remedial instruction 
based on subscale information from the test, however, in the classical 
approach, subscale information has been found to be quite unreliable (Haladyna 
1974). 
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Lnpirical Aspect oL' ttie Analysis 

Tais part of the analysis begins with an application oC item sampling 
to hypothetical situations. The tables constructed (see Table 1) are by 
no means exhaustive but are meant to be illustrative of a wide variety of 
instructional situations that are encountered in a great many individualized 
instruction systems. The second part is an application of both procedures 
to sets of achievement data which meet the requirements of a DRT. It is 
in the second phase that the criterion of effectiveness becomes crucial. 

In Table 1, UB's are presented for a variety of situations where the 
domain size is unspecified, 1000, 500. 100, 50 and 25; where the test varies 
from 5, 10, 20, 30, 50, 75, 100; where the PS varies from 50, 70, 90, 99. 
The UB is a percentage scale. Large Ub's indicate the potential for ineffective 
decisioimiaking whereby too many students are assigned uncertain status and 
where the confidence zone is too large with respect to the scale. 

Table 1 reveals that whenever the PS is high or the test size is ex- 
tremely small, no one can be assigned a passing status. Therefore, the 
sampling procedure cannot be applied to these situations. The cutoff for 
this appears to be in the high 80%'s for most situations. That is, if a PS 
is higher than around 85%, passing status can seldom be made using the sampl- 
ing plan. 

In situations where test size (n) is small, the UB is also quite large. 
Ttiis result is consistent with classical theory where Haladyna (1974) reported 
low subscale reliabilities for CR tests. The rest of Table 1 serves to 
illustrate that when n approaches N and the PS is high, the Ub's are very 
small. Ifnen the PS is low, between 50 and 70%, UB's arc larger. For example, 
when a 50% confidence interval is justified and a 30 item test is used to 
measure a large domain, the uncertainty region includes the range of scores 
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TABLE 1 

Uncertainty Bands as a Function of Domain Size (N), Test Size (n) , 
and the Passing Standard (PS) 
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f> m 32% to bli/> inc.iusivc. To v;hat degree this interval includes exaraxnees 
is one of empirical detemination* If a blniodal distribution exists re- 
presenting the instructed and non-instructed groups, then such a raodel 
would lend to more effective decisionniaking. 

When N is large (over 1000) , which is not unusual in tern»s of present 
day item and objective banks, the UB's appear consideraoly larger relative 
to UB's for similar test lengths from smaller domains. Thus the UB appears 
most suited for domains of swall size (N). 

To summarize these findings; (a) the sampling approach doe^ not yield 
useful decisionmaking capabilities when the PS is high (over 85% approximately); 
(b) UB's are very large when the PS is low, between 50% to 70%, suggesting 
that if instruction is less than superior or measurement error is large, 
far too many students may be found in the uncertainty band; (c) when the 
domain size is small, and the test size is relatively l^irge, UB's are 
more effecr.ive in decisionmaking than in other situations. The empirical 
question that, arises is v/hat proportion of examinees are given passing, 
uncertain, and failing assignments when real data is used? We turn to the 
second phase of this empirical aspect of the analysis. 

The data employed here are quite varied and non-representative for all 
possible DIlT's. Konetiieless , the tests are DR, and some inferences may be 
validly draw., however limited in generality they are. 

The first set of aata was taken from an undergraduate measurement course 
where the tests were CU, the PS was 70%, and the instructional system mastery. 
Although the tests were objective-referenced, items were pooled into con- 
ceptually homogeneous domains, and iteras were randomly samp] ed into test 
forms. Thus the tests were CR and DR by virtue of the defining characteristics 
of each. Tests were administered before and after instruction and ranged 
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in size trom 4i to 45 items. Test characteristics were reported by Haiadyna 
(1974). The subscale inforination is omitted due to the fact that the SE's 
were extremely large relative to the variance of the subscales. Since the 
sampling approach also leads to large UB's with short scales, these data 
were omitted from further consideration. 

The second set of data confoms more loosely to the DRT definition pre- 
viously given. The tests were of high quality, and the items were objective- 
referenced. The test forms were drawn from a pool of items representing 
the domain of dental anatomy, and the items were keyed to a five volurae 
dental anatomy text. The test was administered in a number of dental schools 
as an achievement test, although the use of a PS is difficult to determine 
from school to school. Despite these limitations, the tests minimally meet 
the requirements for a DRT. 

In Table 2, the SE's and UB's for the first set of data are presented 
for one form for each of three instructional units. Also presented is the 
percentage of students falling in the categories of pass, uncertain, and 
fail for each approach. The PS actually used was 70%. If this standard 
had been applied to the students ia this classroom testing situation, about 
the same number of students v/ould be classified in each category regard- 
less of the procedure. A.s the PS was lowered, the classical procedure proved 
to be more effective. As the PS was raised, the .samplinr^ approach way more 
effective. Both approaches resulted in far too many students being cate- 
gorized as uncertain. Since it is assumed that any system that leads to 
uncertainty about a great nuniber of examinees is less than useful, both 
approaches must he rejected. On the other han'\ the fault may lit with the 
PS. If for purposes of validity or increasing motivation or decreasing 
anxiety, it is likely that the more appropriate PS should be 50^, the 
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TABLE 2 

Classical Confidence Interval, Uncertainty Band, 
Percentages of Passes, Fails > and Uncer tains for Three DRT's 
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classical proceaure would lead to a smaller confidence band and thus prove 
more effective. 

Looking at t!ie second set of data, shouii in Table 3, two 100 items forms 
of the test were used. Here, there is less tangible evidence for setting 
a PS, however, ideally we would use some prior infomation for the estab- 
lishment of a PS. Thus we can only speculate that if a PS shown in Table 3 
was 70% for form A, 64% of all students would be assigned to a doubtful or 
failing status, while 77% would have a similar fate using the sampling pro- 
cedure. Again, the two procedures result in the preponderance of examinees 
falling in the doubtful or failing ranges. Only when the PS is quite low 
(65!''; does either procedure lead to seemingly efficient decisionmaking. 
That raises tlie question: For wnat reason does one set a PS? Is it to 
make decisionmaking more efficient, to ensure the valid assignment of 
examinees to the next step in instruction, to lower anxiety, to motivate? 
Hopefullv future endeavors in decisionmaking will consider some of these 
variables in a systematic way. To be sure, the data presented in Table 2 
and 3 reveal that both classical and item sampling approaches have many 
serious limitations. 

Conclusions 

In many respects this analysis has revealed that there is much to be 
done in the theory of measurement with respect to decisionmaking in indivi- 
dualized systamatic instruction. There is little support for cither classi- 
cal procedures or for the item sampling approach as an aid to decisionmaking. 
Neither appears to meet the criterion of effectiveness, and a relative com- 
parison of the merits of these two would only lead to irrelevant information. 
In other words, neither approach appears to contribute importantly to decision 
makiiig in this instructional context. 
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TABLE 3 

Classical Confidence Interval, Uncertainly Band, 
Percentage of Passes, Fails, and Uncertains for 
Two Parallel Forms DRT's 
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This cinaiysLs does raise several, crucial issues in test devoiopraent 
and decisionniaking.^ In laany respects, the discussion of DRT's has been 
reduced Lo instances where domains are loosely defined. It is clear from 
the writing of Hively et .al . (1973) and Bormuth (1971) that much more was 
intended in DR testing. The issue that arises in classical theory and the 
DR approach is how to define a domain. In the classical approach, one 
looks for items that measure well the hypothetical domain. This procedure 
is much like the development and validation of a . nstruct: (a) define the 
construct abstractly, (b) hypothesize measures of that construct, (c) test 
to see it observables (items) measure that construct. High interiteia 
correlations are essential in establishing the content (factorial) validity 
as well as construcr. validity of the items and tests. Itcims that don't 
belong to the domain have lo\; discrimination indexes and arc discarded or 
rejected. This is similar lo the case in Rasch scaling, where items either 
fit or don't fit the latent trait. In the domain-referenced approach, the 
domain is rigorously defined via item forms or item writing rules and items 
generated in conformance. It is assumed that the rigor that goes into the 
item construction procedures will yield better measures. The hypothesis 
that any procedure leads to better measures needs to be empirically tested. 

Finally, a number of procedures were very briefly described as ap- 
proaches to decisionmaking. It would be useful to test the applicability 
of these approaches with test data. The Baysian approach offers a pro- 
cedure which is a threshold loss function rather than a squared-error less 
function. The former is said to lead to sraalier SE's in decisionmaking 
(Hambleton & Novick, 1973); if so and to what degree is largely Indeterminate 
at the present. In Rasch model, SE's become small when an examinee is 
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matched to tost items, that is, he misses 50% o£ the items. This is con-- 
trary to the principle oC randoraly sampling items from a domain which is 
prominent in classical theory and item sampling theory. In Cronbach's 
theory of generalizability , test scores are used to estimate universe scores 
and are regressed depending upon the group from which the examinee came. 
Again, the question arises, what is the relative degree of error of classi- 
fication? The problem remains to be studied- 

Finally, the problem of where to set the PS is crucial to the decision- 
making process as revealed by much of the data present' ed in this analysis. 
In many respects, if discrimination between instructed and non-ins true ted 
students is desired, setting the PS at the midpoint of the achievement 
scale appears to be most justifiable* The bimodal distribution has the fewest 
examinees at the middle of the scale. In this situation, it is clear that 
the classical approach works more effectively. 
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